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Abstract We develop a mixture procedure for multi-sensor systems to mon¬ 
itor data streams for a change-point that causes a gradual degradation to a 
subset of the streams. Observations are assumed to be initially normal random 
variables with known constant means and variances. After the change-point, 
observations in the subset will have increasing or decreasing means. The subset 
and the rate-of-changes are unknown. Our procedure uses a mixture statistics, 
which assumes that each sensor is affected by the change-point with proba¬ 
bility pq. Analytic expressions are obtained for the average run length (ARL) 
and the expected detection delay (EDD) of the mixture procedure, which are 
demonstrated to be quite accurate numerically. We establish the asymptotic 
optimality of the mixture procedure. Numerical examples demonstrate the 
good performance of the proposed procedure. We also discuss an adaptive 
mixture procedure using empirical Bayes. This paper extends our earlier work 
on detecting an abrupt change-point that causes a mean-shift, by tackling the 
challenges posed by the non-stationarity of the slope-change problem. 

Keywords statistical quality control • change-point detection • intelligent 
systems 


1 Introduction 

As an enabling component for modern intelligent systems, multi-sensory moni¬ 
toring has been widely deployed for large scale systems, such as manufacturing 
systems El: 0: power systems m, and biological and chemical threat detec¬ 
tion systems [5j. The sensors acquire a stream of observations, whose distribu¬ 
tion changes when the state of the network is shifted due to an abnormality 
or threat. We would like to detect the change online as soon as possible after 
it occurs, while controlling the false alarm rate. When the change happens, 
typically only a small subset of sensors are affected by the change, which is a 
form of sparsity. A mixture statistic which utilizes this sparsity structure of 
this problem is presented in m ■ The asymptotic optimality of a related mix¬ 
ture statistic is established in [3) . Extensions and modifications of the mixture 
statistic that lead to optimal detection are considered in [T|. 

In the above references m, the change-point is assumed to cause a shift 
in the means of the observations by the affected sensors, which is good for 
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modeling an abrupt change. However, in many applications above, the change- 
point is an onset of system degradation, which causes a gradual change to the 
sensor observations. Often such a gradual change can be well approximated by 
a slope change in the means of the observations. One such example is shown 
in Fig. |T| where multiple sensors monitor an aircraft engine and each panel of 
figure shows the readings of one sensor. At some time a degradation initiates 
and causes decreasing or increasing in the means of the observations. Another 
example comes from power networks, where there are thousands of sensors 
monitoring hundreds of transformers in the network. We would like to detect 
the onset of any degradation in real-time and predict the residual life time of 
a transformer before it breaks down and causes a major power failure. 

In this paper, we present a mixture procedure that detects a change-point 
causing a slope change to the means of the observations, which can be a 
model for gradual degradations. Assume the observations at each sensor are 
i.i.d. normal random variables with constant means. After the change, obser¬ 
vations at the sensors affected by the change-point become normal distributed 
with increasing or decreasing means. The subset of sensors that are affected 
are unknown. Moreover, their rate-of-clianges are also unknown. Our mixture 
procedure assumes that each sensor is affected with probability po indepen¬ 
dently, which is a guess for the true fraction p of sensors affected. When p 0 
is small, this captures an empirical fact that typically only a small fraction 
of sensors are affected. With such a model, we derive the log-likelihood ra¬ 
tio statistic, which becomes applying a soft-thresholding to the local statistic 
at each sensor and then combining the results. The mixture procedure fires 
an alarm whenever the statistic exceeds a prescribed threshold. We consider 
two versions of the mixture procedure that compute the local sensor statistic 
differently: the mixture CUSUM procedure Tj, which assumes some nominal 
values for the unknown rate-of-change parameters, and the mixture generalized 
likelihood ration (GLR) procedure X 2 , which uses the maximum likelihood es¬ 
timates for these parameters. To characterize the performance of the mixture 
procedure, we present theoretical approximations for two commonly used per¬ 
formance metrics, the average run length (ART) and the expected detection 
delay (EDD). Our approximations are shown to be highly accurate numerically 
and this is useful in choosing a threshold of the procedure. We also establish 
the asymptotic optimality of the mixture procedures. Good performance of 
the mixture procedure is demonstrated via real-data examples, including: (1) 
detecting a change in the trends of financial time series; (2) predicting the life 
of air-craft engines using the Turbofan engine degradation simulation dataset. 

The mixture procedure here can be viewed as an extension of our earlier 
work on multi-sensor mixture procedure for detecting mean shifts [18] . The 
extensions of theoretical approximations to EDD and especially to ARL are 
highly non-trivial, because of the non-i.i.cl. distributions in the slope change 
problem. Moreover, we also establish some new optimality results which were 
omitted from [15], by extending the results in [7] and m to handle non- 
id. d. distributions in our setting. In particular, we generalize the theory to a 
scenario where the log likelihood ratio grows polynomially as a result of linear 
increase or decrease of the mean values, whereas in [T8] , the log-likelihood ratio 
grows linearly. A related recent work [3] studies optimality of the multi-sensor 
mixture procedure for i.i.d. observations, but the results therein do not apply 
to the slope change case here. 

The rest of this paper is organized as follows. Section [2] sets up the for¬ 
malism of the problem. Section [3] presents our mixture procedures for slope 
change detection, and Section[4]presents theoretical approximations to its ARL 
and EDD, which are validated by numerical examples. Section [5] establishes 
the first order asymptotic optimality. Section [6] shows real-data examples. Fi- 
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nally, Section[7]presents an extension of the mixture procedure that adaptively 
chooses po. All proofs are delegated to the appendix. 


2 Assumptions and formulation 

Given A sensors. For the nth sensor n = 1,2,...,A, denote the sequence 
of observations by y n .i, i = 1,2,.... Under the hypothesis of no change, the 
observations at the nth sensor have a known mean y n and a known variance 
<t^. Probability and expectation in this case are denoted by and 
respectively. Alternatively, there exists an unknown change-point that occurs 
at time k, 0 < k < oo, and it affects an unknown subset AC {1,2,...,A} of 
sensors simultaneously. The fraction of affected sensors is given by p = \A\/N. 
For each affected sensor n G A, the mean of the observations y n j changes 
linearly from the change-point time k + 1 and is given by p n + c n (t — n) 
for all t > k, and the variance remains <r^. For each unaffected sensor, the 
distribution stays the same. Here the unknown rate-of-change c n can differ 
across sensors and it can be either positive or negative. The probability and 
expectation in this case are denoted by P;5 and E;f, respectively. In particular, 
k = 0 denotes an immediate change occurring at the initial time. The above 
setting can formulate as the following hypothesis testing problem: 

H 0 ■ Vn,i ~ A f{p n , @n)i * = 1, 2, . . . , 71 = 1, 2, . . . , A, 

H\ . Vn 7 i ~ A/*(/r n , cr„), i — 1, 2,..., /-v, 

Vn,i ^ N (/G T C-n{i ® n) i ^ ^ T 1, n T 2, 

Vn,i rs ' 1 A /"^ 1, 2, . . . , 71 G A . 

Our goal is to establish a stopping rule that stops as soon as possible after a 
change-point occurs and avoids raising false alarms when there is no change. 
We will make these statements more rigorous in Section [f] and Section [5] 
Here, for simplicity, we assume that all sensors are affected by the change 
simultaneously. This ignores the fact that there can be delays across sensors. 
For asynchronous sensors, one possible approach is to adopt the scheme in [4], 
which claims a change-point whenever the any sensor detects a change. We 
plan investigate the issue of delays in our future work. 

A related problem is to detect a change in a linear regression model. One 
such example is a change-point in the trend of the stock price illustrated in 
Fig. [7j a). This can be casted into a slope change detection problem, if we fit 
a linear regression model under H 0 (e.g., using historical data) and subtract 
it from the sequence. The residuals after the subtraction will have zero means 
before the change-point, and their means will increase or decrease linearly after 
the change-point. 

Remark 1 (Reason for not differencing the signal.) One legitimate question 
is that why not de-trending the signal at each sensor by difference, which 
may turn the slope change into a mean change problem and we can apply the 
standard CUSUM procedure designed for detecting the mean shift. Indeed, 
for the affected sensors after the change-point, E[?/ nj ,; + i — y n ^\ = c n . However, 
differencing will also increase the variance, as Var[y nj j + i —y n ,i\ = 2crHence, 
differencing reduces the signal-to-noise ratio and this is particularly bad for 
weak signals and makes them even non-detectable. This is validated by real 
data as well. Consider the engine data displayed in Fig. |T| The first panel in 
Fig. [^corresponds to observations of one sensor that is affected by noise, which 
clearly has the “signal” as the mean is increasing. However, after we difference 
the signal, the change is almost invisible, as illustrated. We then try applying 
CUSUM on the difference, where the statistic either rises slowly which means 
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Fig. 1 Degradation sample paths recorded by 21 sensors, generated by C-MAPSS Hu- 
A subset of sensors are affected by the change-point, which happens at an unknown time 
simultaneously and it causes a change in the slopes of the signals. The change can cause 
either an increase or decrease in the means. 


it cannot detect quickly. We also try applying CUSUM (designed for mean 
change detection) directly on the original signal. Although the statistics rises 
reasonably fast; however, clearly this statistic cannot estimate the change- 
point location accurately due to its model mismatch. The last panel in Fig. [2] 
shows our proposed statistic, which can detect the change fairly quickly, and 
it can also accurately estimate the change-point location. 



Fig. 2 Engine data displayed in Fig. |lj the first panel corresponds to observations of one 
sensor that is affected by noise, which clearly has the “signal” as the mean is increasing. 
However, after we difference the signal, the change is almost invisible. Then we compare ap¬ 
plying CUSUM procedure (designed for mean shift) on the original signal, on the difference, 
and applying our proposed statistic on the original signal. 


3 Detection procedures 

Since the observations are independent, for an assumed change-point location 
k = k and an affected sensor n £ A, the log-likelihood for observations up to 
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time t > k is given by 


£ n ^n) 


1 x ^ 

TT - ^ 'y ] [2c n (2/n,i — Mn)(^ — fc) — c n (^ — &) ] • 

2a - i=1ti 


( 2 ) 


Motived by the mixture procedure in [lftj and na to exploit an empirical fact 
that typically only a subset of sensors are affected by the change-point, we 
assume that each sensor is affected with probability p 0 £ (0,1] independently. 
In this setting, the log likelihood of all N sensors is given by 

N 

^2 lo S (! - Po + Po exp [e n (k, t, c„)]). (3) 

n =1 

Using ([3]), we may derive several change-point detection rules. 

Since the rate-of-change c n is unknown, One possibility is to set c n equal 
to some nominal post-change value S n and define the stopping rule, referred 
to as the mixture CUSUM procedure: 


N 


T\ = inf < t : max V' log (1 - p 0 + Po exp [t n (k, t, $„)]) > b 

■ 0 <k<t z ' I 


n—1 


( 4 ) 


where b is a threshold typically prescribed to satisfy the average run length 
(ARL) requirement (formal definition of ARL is given in Section [4]). 

Another possibility is to replace c n by its maximum likelihood estimator. 
Given the current number of observations t and a putative change-point loca¬ 
tion k, by setting the derivative of the log likelihood function © to 0, we may 
solve for the maximum likelihood estimator: 


c n (k,t) 


~ Pn) 

£U+i(i-*0 2 


( 5 ) 


Define t = t — k to be the number of samples after the change-point k. Denote 
the sum of squares from 1 to r, and the weighted sum of data as, respectively, 

T t 

— y ^ i , bl ^n,k,t — y ^ ^0(2/n,i Mn)/^n* 

i—1 i=k-\-l 

Let 


u n , k ,t=(A T )- 1/2 w nAt . (6) 

Substitution of <[5j) into ([ 2 ]) gives the log generalized likelihood ratio (GLR) 
statistic at each sensor: 

£ n (k,t,c n ) = Ul k J 2, (7) 

and we define the mixture GLR procedure as 


T 2 = inf 


N 


t : max 
0 <fc<i 


]T log (1 


n—1 


Po +p 0 exp [Ul^J 2]) 



( 8 ) 


where b is a prescribed threshold. 
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Fig. 3 Matched filter interpretation of the generalized likelihood ratio statistic at each 
sensor U n ^,t = Ay ' JZi=fe+i(® — tyiVn.i ~ /in)/ Gn- data at each sensor is matched with a 
triangle-shaped signal that starts at a hypothesized change-point time k and ends at the 
current time t. The slope of the triangle is AA ] , so that the norm of the triangle signal 
is one. 

Remark 2 (Window limited procedures.) In the following we use window lim¬ 
ited versions of T\ and T 2l where the maximum for the statistic is restricted 
to a window t~w<k<t~w' for suitable choices of window size w and 
w'. In the following, we use T to denote a window-limited version of a pro¬ 
cedure T. By searching only over a window of the past w — w' + 1 samples, 
this reduces the memory requirements to implement the stopping rule, and it 
also sets a minimum level of change that we want to detect. The choice of w 
may depend on b and sometimes we need make additional assumptions on w 
for the purpose of establishing the asymptotic results below. More discussions 
about the choice of w can be found in |B] and jT]. The other parameter w' 
is the minimum number of observations needed for computing the maximum 
likelihood estimator for parameters. In the following, we set w' = 1. 


Remark 3 (Relation to mean shift.) For the mean-shift multi-sensor change- 
point detection [T8] , the detection statistic depends on a key quantify, which is 
the average of the samples in the time window [k +1, t]. Note that in the slope 
change case, the detection statistic has a similar structure, except that the 
key quantity is replaced by a weighted average of the samples in the window: 
(t — k)~ l t 2 X^=fc+i(2/n,* — (J"n)/o’n- This has an interpretation of “matched 
filtering”, as illustrated in Fig. [3} each data stream is matched with a triangle 
shaped signal starting at a potential change-point time k that represents a 
possible slope change. 

Remark 4 (Recursive computation.) The quantity W Ut k,t involved in the de¬ 
tection statistic for ^ can be calculated recursively, 

W^ n ,k,t+1 = W^ n .k } t T (t 1 k ) ((?/n,i-(-1 Mn)/ ®n) ) 

where IT n ,t,t — 0. This facilitates online implementation of the detection pro¬ 
cedure. The quantity A T can be pre-computed since it is data-independent. 

Remark 5 (Extension to correlated sensors.) The mixture procedure <[8j) can 
be easily extended to the case where sensors are correlated with a known co- 
variance matrix. Define a vector of observations y i = [yi,i ,... ,yN,i] J f° r all 
sensors at time i. When there is no change, y i follows a normal distribution 
with a mean vector p = [p 1 ,..., pn] t and a covariance matrix Alterna¬ 
tively, there may exist a change-point at time k such that after the change, the 
observation vectors are normally distributed with mean vector p + (i — k)c, 
c = [ci,..., c n ] J and the covariance matrix remains Eq for all i > k. We can 
whiten the signal vector by y i = E 0 1 ^ 2 (y i — p), where E 0 1 ^ 2 is the square- 
root of the positive definite covariance matrix that may be computed via its 
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eigen-decomposition. The coordinates of y, are independent and the problem 
then becomes the original hypothesis testing problem 0 with all sensors be¬ 
ing affected simultaneously by the change-point, the rate-of-change vector is 
S Q c, the mean vector is zero before the change, and the covariance remains 
an identity matrix before and after the change. Hence, after the transform, we 
may apply the mixture procedure with po = 1 on y.;. 


4 Theoretical properties of the detection procedures 


In this section we develop theoretical properties of the mixture procedure. We 
use two standard performance metrics (1) the expected value of the stopping 
time when there is no change, the average run length (ARL); (2) the expected 
detection delay (EDD), defined to be the expected stopping time in the ex¬ 
treme case where a change occurs immediately at n = 0. Since the observations 
are i.i.d. under the null, the EDD provides an upper bound on the expected 
delay after a change-point until detection occurs when the change occurs later 
in the sequence of observations (this is also a commonly used fact in change- 
point detection work [T8] ). An efficient detection procedure should have a large 
ARL and meanwhile a small EDD. Our approximation to the ARL is shown 
below to be accurate. In practice, we usually fix ARL to be a large constant, 
and set the threshold b in 0 accordingly. The accurate approximation here 
can be used to find the threshold analytically. Approximation for EDD shows 
its dependence on a quantity that plays a role of the Kullback-Leibler (KL) 
divergence, which links to the optimality results in Section [5] 

Average run length (ARL). We present an accurate approximation for 
ARL of a window limited version of the stopping rule in ([8]) , which we denote 
as T 2 . Let 

g(x) = log(l -po+po exp(x 2 /2)), (9) 

and 

= logE{exp[0 5 (Z)]}, 

where Z has a standard normal distribution. Also let 

7 (0) = ^0 2 E {[g(Z)\ 2 exp [0g{Z) - if, {9)]} , 

and 

H{N,9) = /T exp{iV[0^(0) - V>(0)]|, 

where the dot / and double-dot / denote the first-order and second-order 
derivatives of a function /, respectively. Denote by 4>(x) and <P(x) the stan¬ 
dard normal density function and its distribution function, respectively. Also 
define a special function v(x) = 2x~ 2 exp[— 2 n~ 1( P(— \x\n 1 / 2 /2)\. For 

numerical purposes an accurate approximation is given by m 

~ (2/xmx/2)-l/2] 

(x/2)$(x/2) + fi(x/2 )' 


Theorem 1 (ARL of T 2 .) Assume that N —oo and b —> oo with b/N fixed. 
Let 0 be defined by ip(0) = b/N. For a window limited stopping rule of 0) with 
w = o(b r ) for some positive integer r, we have 


Eoc{T 2 } = R(A,0). 


ryJ^N/ (4/3) 1/2 
/ v / 2Af/(4tu/3) 1 /2 


yv 2 (vV'y (°))dy 


o(l). (10) 
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The proof of Theorem [T] is an extension of the proofs in [T8] and m using 
the change of measure techniques. To illustrate the accuracy of approximation 
given in Theorem [l] we perform 500 Monte Carlo trials with p 0 = 0.3, and 
w = 200. Figs. |4]ja) and (b) compare the simulated and theoretical approxi¬ 
mation of ARL given in Theorem [l] when N = 100 and N = 200, respectively. 
Note that expression ([To]) takes a similar form as the ARL approximation 
obtained in (l8j for the multi-sensor mean-shift case, and only differs in the 
upper and lower limits in the integration. In Figs.^a) and (b) we also plot the 
approximate ARL for the mean shift case in [18] , which shows the importance 
of having the corrected integration upper and lower limits in our approxima¬ 
tion. In practice, ARL is usually set to 5000 and 10000. Table [I] compares 
the thresholds obtained theoretically and from simulation at these two ARL 
levels, which demonstrates the accuracy of our approximation. 




(a) (b) 

Fig. 4 (a) Comparison of theoretical and simulated ARL when (a): N = 100, po = 0.3, and 
w = 200; (b): N = 200, po = 0.3, and w = 200. 


Table 1 Theoretical versus simulated thresholds for po = 0.3, N = 100 or 200, and w = 200. 



ARL 

Theory b 

Simulated ARL 

Simulated b 

N = 100 

5000 

46.34 

5024 

46.31 


10000 

47.64 

10037 

47.60 

N = 200 

5000 

77.04 

5035 

76.89 


10000 

78.66 

10058 

78.59 


Expected detection delay (EDD). After a change-point occurs, we are 
interested in the expected number of additional observations required for de¬ 
tection. In this section we establish an approximation upper bound to the 
expected detection delay. Define a quantity 


A = 



( 11 ) 


which roughly captures the total signal-to-noise ratio of all affected sensors. 

Theorem 2 (EDD of X 2 .) Suppose b —► 00 , with other parameters held fixed. 
Let U be a standard normal random variable. If the window length w is suffi¬ 
ciently large and greater than (6b/ A 2 ) 1 / 3 , then 

r ~ ^ r b - \A\ logpo - (N - \A\)K{g(U)} \ 1/3 
E ° {T2} - 1 -Z76-/ 


+ o(l), (12) 
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where Eq 1 is defined at the beginning of Section]^ To demonstrate the accuracy 
of (12), we perform 500 Monte Carlo trials. In each trial, we let the change- 
point happen at the initial time and randomly select Np sensors affected by 
the change and set the rate-of-change c n = c for a constant c, n £ A. The 
thresholds for each procedure are set so that their ARLs are equal to 5000. 
Fig.[5]shows EDD versus c, where our upper bound turns out to be an accurate 
approximation to EDD. 



Fig. 5 Comparison of theoretical and simulated EDD when N = 100, po = 0.3, p = 0.3, 
and w = 200. All rate-of-change c n = c for affected sensors. 


5 Optimality 

In this section^we prove that our detection procedures: T\ and the window lim¬ 
ited versions T\ and T 2 are asymptotically first order optimal. The optimality 
proofs here extends the results in m , h, for our multi-sensor non-i.i.d. data 
setting. The non-i.i.d. ness is due to the fact that under the alternative, the 
means of the samples change linearly as the number of post-change samples 
grows. Following the classic setup, we consider a class of detection procedures 
with their ARL greater than some constant 7, and then find an optimal pro¬ 
cedure within such a class to minimize the detection delay. Since it is difficult 
to establish an uniformly optimal procedure for any given 7, we consider the 
asymptotic optimality when 7 tends to infinity. 

We first study a general setup with non-?'.fid. distributions for the multi¬ 
sensor problem, and establish optimality of two general procedures related to 
T\ and X 2 . Then we specialize the results to the multi-sensor slope-change 
detection problem. In particular, we generalize the lower bound for the detec¬ 
tion delay from the single sensor case (Theorem 8.2.2 in m and Theorem 1 
in 0) to our multi-sensor case. We also generalize the result therein to our 
setting where the log-likelihood ratio grows polynomially on the order of j q 
for q > 1 as the number of post-change observations j grows (in the classic 
setting q = 1); this is used to account for the non-stationarity in our problem. 

Setup for general non-i.i.d. case. Consider a setup for the multi-sensor 
problem with non-?.?.d. data. Assume there are N sensors that are indepen¬ 
dent (or with known covariance matrix so the observations can be whitened 
across sensors), and that the change-point affects all sensors simultaneously. 
Observations at the nth sensor are denoted by x n j, over time t = 1,2,.... 
If there is no change, x n , t are distributed according to conditional densities 
fnA x n,t\ x n,[i,t-i]), where = (a;„,i,..., (this allows the dis¬ 

tributions at time t to be dependent on the previous observations). Alter¬ 
natively, if a change-point occurs at time n and the nth sensor is affected, 
x n j are distributed according to conditional densities f n ,t{ x n,t\ x n,[i,t-i]) f° r 
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t = 1,...,«, and are according to gl?l{ x n,t l^nji.t-il) f° r t > k. Note that 
the post-change densities are allowed to be dependent on the change-point k. 
Define a filtration at time t by Ft = cf(xi [ l t ],... ,a;jv,[i,i])- Again, assume a 
subset A Cjl, 2,..., N} of sensors are affected by the change-point. Similar 
to Section [2j with a slight abuse of notation, we denote , P;f and 

as the probability and expectation when there is no change, or when a 
change occurs at time n and a set A of sensors are affected by the change, 
with the understanding that here the probability measures are defined using 
the conditional densities. 

Optimality criteria. We adopt two commonly used minimax criteria to es¬ 
tablish the optimality of a detection procedure T. Similar to Chapter 8.2.5 
of (15J . we consider two criterions associated with the m-th moment of the 
detection delay for m > 1. The first criterion is motivated by Lorden’s work 
[9j, which minimizes the worst-case delay 

ESM£(T) = sup esssupE^{[(T-fc)+] m |J- fc }, (13) 

0<fc<oo 


where “esssup” denotes the measure theoretic supremum that excluded points 


of measure zero. In other words, the definition (131 first maximizes over all 


possible trajectories of observations up to the change-point and then over the 
change-point time. The second criterion is motivated by Poliak’s work m, 
which minimizes the maximal conditional average detection delay 


SM„(T) = sup Ef{(T-k) m \T> k}. 

0<k<oo 


(14) 


The extended Poliak’s criterion (fT4|) is not as strict as the extended Lorden’s 


criterion in the sense that SM^(T) < ESM^(T), and we prefer (141 since it is 


connected to the conventional decision theoretic approach and the resulted op¬ 
timization problem can possibly be solved by a least favorable prior approach. 
The EDD defined earlier in Section [4] can be viewed as ESM m and SM m for 
to = 1 , and the supremum over k happens when k = 0 . 

Define C(j) to be a class of detection procedures with their ARL greater 
than 7 : 

C( 7 ) 4 {T : EJT} > 7 }. 

A procedure T is optimal, if it belongs to C( 7 ) and minimizes ESM m (T) or 
SM m (T). 

Optimality for general non-Li .d setup. Under the above assumptions, the 
log-likelihood ratio for each sensor is given by 


t 

E 

i=k -\-1 


9n,i [l,i—1]) 

fn,i (%n,i \%n,[l,i— 1]) 




For any set A of affected sensors, the log-likelihood ratio is given by 


^ A,k,t — ^ ^ ^ n,k,t • 
neA 


(15) 


We first establish an lower bound for any detection procedure. The constant 
Iy i below can be understood intuitively as a surrogate for the Kullback-Leibler 
(KL) divergence in the hypothesis problem. When the observations are i.i.d ., 
I A is precisely the KL divergence [7]. 
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Theorem 3 (General lower bound.) For any A C {1 such that 

there exists some q > 1 , j~ q \A,k,k+j converges in probability to a positive 
constant 7 4 £ (0,00) under P/ 1 , 


—~^A,k,k+j —► I A, ( 16 ) 

J " J—^oo 

and in addition, for all e > 0 , for an arbitrary M —> 00 
sup esssup P^ < A/ -9 max A A,k,k+j > (1 + £)Ia 

0<k<oo ( 0<j<M 


Tk 


M—> 00 


> 0 . ( 17 ) 


Then, 

(i) for all 0 < e < 1 , there exists some k > 0 such that 


lim sup P k \k <T < k + (1 

7->oo TeC(7) 1 


£ )(^ 1 l°g7)' 


T > k \ = 0 . 


( 18 ) 


(ii) for all to > 1 , 


lim . nf inf T6C(T , E 5<,(T) >1|minf inf T6CM 5i<(r) ? _1^ (w) 


7—»-oo 


(log 7 ) m / q 


7—>-oo 


(log 7) m /9 


r">/? ' 


Consider a general mixture CUSUM procedure related to T), which has 
also been studied in El and [TS] : 


T C s = inf < t 


may 


N 


( 20 ) 


where & is a prescribed threshold. The following lemma shows that for an 
appropriate choice of the threshold b , Tq s has an ARL lower bounded by 7 
and, hence, for such thresholds it belongs to C{ 7). 

Lemma 1 For any p 0 £ ( 0 , 1 ], Tcs(b) £ C( 7), provided b > log7. 


Theorem 4 (Optimality of Tcs-) -For any dC {1,..., N} such that there 
exists some q > 1 and a finite positive number I a £ (0, 00) for which ( 17 | i 
holds, and for all e £ (0,1) and t > 0, 

sup esssnp (, j~ q \A,k,k+j < Ia{ 1 - e)!^) -> 0 . ( 21 ) 

0 <k<t t->oo 


If b > log7 and b = G(logy), then Tcs asymptotically minimax in the class 
(7(7) in the sense of minimizing ESM^(T) and SM^(T) for all to > 1 to the 
first order as 7 —»• 00. 


We can also prove that the window-limited version Tcs is asymptotically 
optimal. Since the window length affects ARL and the detection delay, in the 
following we denote this dependence more explicitly by w 1 . 


Corollary 1 (Optimality of Tcs-) Assume the conditions in 
hold and in addition, 

.. . . w 7 

limml- - —77- > 1. 

7^00 ( logy //*) 1 / 9 


Theorem 


a 


( 22 ) 


If b > logy and b = O(logy), then Tcs{b) is asymptotically minimax in the 
class C( 7) in the sense of minimizing ESM^(T) and SM^(T) for all m > 1 
to the first order as 7 —► 00. 
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Intuitively, this means that the window length should be greater than the 
first order approximation to the detection delay [(logy)//^] 1 ' 7 ' 3 . Note that our 
earlier result ( 12 ) for the expected detection delay of the multi-sensor case is 
of this form for q — 3 and I a = A 2 /6. 

Similarly, we may consider a general mixture GLR procedure related to I 2 
as in [18]. Denote the log-likelihood |I5| ) as \A,k,t.(0) to emphasize its depen¬ 
dence on an unknown parameter 6. The mixture GLR procedure maximizes 
9 over a parameter space 0 before combining them across all sensors. Unfor¬ 
tunately, we are unable to establish the asymptotic optimality for the general 
GLR procedure and its window limited version, due to a lack of martingale 
property. 


Optimality for multi-sensor slope change. Note that T\ and T\ corre¬ 
spond to special cases of Tq s, 7cs, so we can use Theorem |4j and Corollary 
[T] to show their optimality by checking conditions. Although we are not able 
to establish optimality of the general mixture GLR procedure as mentioned 
above, we can prove the optimality for T 2 by exploiting the structure of the 
problem. 


Lemma 2 (Lower bound.) For the multi-sensor slope change detection prob¬ 
lem in (0; for a non-empty set A C. {1,, N}, the conditions of Theorem [| 
are satisfied when q = 3 and I a = A 2 / 6 . 

The following lemma plays a similar role as the general version Lemma [l] in 
our multi-sensor case in (|T|) , and it shows that for a properly chosen threshold 
b , ARL of T 2 is lower bounded by 7 and, hence, for such threshold it belongs 
to C( 7 ). 

Lemma 3 For any p 0 £ (0,1], T 2 (b) £ provided 


b > N/ 2 — 4 log 


1 - (1 - 1/7 


Remark 6 (Implication on window length.) Lemma [3] shows that to have b = 
0 (log 7 ), we need logu ; 7 = o(log 7 ). 


Theorem 5 (Asymptotical optimality of Ti, T) and T 2 .) Consider the 
multi-sensor slope change detection problem ©• 

(i) If b > log 7 and b = ©(logy), then Ti(b) is asymptotically minimax in 
class C( 7 ) in the sense of minimizing expected moments ESM^iT) and 
SM^(T) for all m > 1 to the first order as 7 — >• 00 . 

(ii) In addition to conditions in (i), if the window length satisfies 


liminf- — > 1 , 

T ^ 00 [ 6 (log 7 )/AX 2 ] 1/3 


(23) 


then Ti(b) is asymptotically minimax in class C( 7 ) in the sense of min¬ 
imizing expected moments ESM^IT) and SM^(T) for all m > 1 to first 
order as 7 —> 00 . 

(Hi) If b > N/2 — 41og[l — (1 — l/ 7 ) 1 ^ u ' 7 ], b = 0 (log 7 ), the window length 
satisfies log(rc 7 ) = o(log 7 ) and (23) holds, then T 2 (b) is asymptotically 
minimax in class C( 7 ) in the sense of minimizing ESM^(T) and SM^(T) 
for m = 1 to first order as 7 — > 00 . 


Remark 7 Above we prove the optimality of 7\(6) and T\(b) for m > 1. How¬ 
ever, we can only prove the optimality of T 2 {b) for a special case m = 1, due 
to a lack of martingale properties here. 
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6 Numerical Examples 


Comparison with mean-shift GLR procedures. We compare the mixture 
procedure for slope change detection, with the classic multivariate CUSUM 
H3 and the mixture procedure for mean shift detection |18| . The multivariate 
CUSUM essentially forms a CUSUM statistic at each sensor, and raises an 
alarm whenever a single sensor statistic hits the threshold. As commented 
earlier in Remark [ 3 J the only difference between Ti and the mixture procedure 
for mean shift in |TBj is how U nt k,t is defined. Following the steps for deriving 


(32), we can show that the mean shift mixture procedure is also asymptotically 


optimal for the slope change detection problem. Here, our numerical example 
verifies this, and show that the improvement of EDD by using X 2 versus the 
multi-variate CUSUM and the mean-shift mixture procedure is not significant. 
However, the mean-shift mixture procedure fails to estimate the change-point 
time accurately due to model mismatch. Fig.[6]shows the mean square error for 
estimating the change-point time k, using the multi-chart CUSUM, the mean- 
shift mixture procedure, and T 2 , respectively. Note that T 2 has a significant 
improvement. 



Fig. 6 Comparison of mean square error for estimating the change-point time for the mix¬ 
ture procedure tailored to slope change T 2 , the mixture procedure with mean shift, and 
multi-chart CUSUM, when N = 100, po = 0.3, p = 0.5 and w = 200. 


Financial time series. In the earlier example illustrated in Fig. (T^a), the 
goal is to detect a trend change online. Clearly a change-point occurs at time 
8000 in the stock price, and such a change-point is verifiable. Fig. shows 
that there is a peak in the bid size versus the ask size, which usually indicates 
a change in the trend of the price (possible with some delay). To illustrate 
the performance of our method in this financial dataset, we plot the detection 
statistics by using a “single-sensor”, i.e., using only one data stream, and by 
using “multi-sensor” scheme, i.e. using data from multiple streams, which in 
this case correspond to 8 factors (e.g, stock price, total volume, bid size and 
bid price, as well ask size and ask price). In fact, only 4 factors out of 8 
factors contain the change-point. Fig. [7](c) plots the statistic if we use only a 
single-sensor. Fig. [7](d) illustrates the statistic when we use all the 8 factors 
and preprocess by whitening with the covariance of the factors as described 
in Section [H The statistics all rise around time 8000 with the multi-sensor 
statistic to be smoother and indicates a lower false detection rate. Looking at 
Fig.[7[d), after the major trend change (around sample index 8000), the multi¬ 
chart CUSUM statistic rises the slowest. Although it appears, the slope-change 
mixture procedure rises a bit slower than the mean-shift mixture procedure, 
we demonstrate in simulation that for fixed ARL these two procedures have 
similar EDDs, and also in Fig. 5 that the slope-change mixture procedure has 
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a better performance in estimating k* than the mean-shift mixture procedure. 
Therefore, the slope-change mixture procedure is still preferrable. 



(a) Stock price data. 



x 10 4 


(c) Single-sensor. 



(e) Multi-sensor, correlation. 


350 
300 
0250 
1 200 



(b) ask-size/bid-size 



(d) Zoom-in of (c). 



(f) Zoom-in of (e). 


Fig. 7 Statistic for detecting trend changes in financial time series with w = 500 for both 
single sensor and multi-sensor procedures, and po = 1 for the multi-sensor procedure. 


Aircraft engine multi-sensor prognostic. We present an engine prognostic 
example using the aircraft turbofan engine dataset simulated by NASif] In the 
dataset, multiple sensors measure different physical properties of the aircraft 
engine to detect a faulty condition and to predict the whole life time. The 
dataset contains 100 training systems and 100 testing systems. Each system 
is monitored by N = 21 sensors. In the training dataset, we have a complete 
sample path from the initial time to failure for each of the 21 sensors of each 
training system. In the testing dataset, we only have partial sample paths (i.e., 
the system fails eventually but we have not observed that yet and it still has a 
remaining life). Our goal is to predict the whole life for the test systems using 
available observations. The dataset also provides ground truth, i.e., the actual 
failure times (or equivalently the whole life) of the testing systems. 

We first apply our mixture procedures to each training system j. j = 
1,..., 100, to estimate a change-point location K,j (which corresponds to the 
maximizer of k in the definition of X 2 when the procedure stops), and the 
rate-of-change at nth sensor for the jth system c nj using ([ 5 ]). Then fit a 
simple survival model using kj and c n j as regressors in determining the re¬ 
maining life. We build a model for the Time-To-Failure (TTF) Y) of system 
j based a log location-normal model, which is commonly used in reliability 
theory [2]: P {Yj < y} = $ [(log(y) — ^j)/v\ > where 77 is a user specified scale 


1 Data can be downloaded from http://ti.arc.nasa.gov/tech/dash/pcoe/prognostic-data- 
repository/ 
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parameter that is assumed to be the same for each system, 7 Tj is the loca¬ 
tion parameter that is assumed to be a linear function of the rate-of-change: 
7 Tj = /3 q + PjCn,j> where (fio, Pi, ■ ■ •, Pn) is a vector of the regression 

coefficients that are estimated by maximum likelihood. Next, we apply the 
mixture procedure on the jth testing system to estimate the change-point 
time kj and the rate-of-change c n j, and substitute them into the fitted mod¬ 
els to determine a TTF using the mean value. The whole life of the jth system 
is estimated as k 3 plus its mean TTF. 

We use the relative prediction error as performance metric, which is the 
absolute difference between the estimated life and the actual whole life, divided 
by the actual whole life. Fig. [8] shows the box-plot of the relative prediction 
error versus threshold b. Our method based on change-point detection works 
well and it has a mean relative prediction error around 10%. Here the choice 
of the threshold b has a tradeoff: the relative prediction error decreases with 
a larger b ; however, a larger b also causes a longer detection delay. 


0.5 
0.45 
0.4 
0.35 
t 0.3 
§ 0.25 
=6 0.2 
£ 0.15 
0.1 
0.05 
0 


Fig. 8 Aircraft engine prognostic example: box-plot for relative prediction error of the 
estimated life time of the engine versus threshold b. 



7 Discussion: adaptive choice of po 

The mixture procedure assumes that a fraction po of the sensors are affected by 
the change. In practice, po can be different from p which is the actual fraction 
of sensors affected. The performance of the procedure is fairly robust to the 
choice of p 0 • Fig. [9] compares the simulated EDD of a mixture procedure with 
a fixed po value, versus a mixture procedure when setting p 0 = p if we know 
the true fraction of affected sensors. Again, thresholds are chosen such that 
ARL for all cases are 5000. Note that the detection delay is the smallest if po 
matches p; however, EDD in these two settings are fairly close when po 7 ^ P- 
Still, we may improve the performance of the mixture procedure by adapt¬ 
ing the parameter p 0 using a method based on empirical Bayes. Assume each 
sensor is affected with probability po, but now po itself is a random variable 
with Beta distribution Beta(a,/3). This also allows the probability of being 
affected to be different at each sensor. With sequential data, we may update 
by computing a posterior distribution of po using data in the following way. 
Choosing a constant a, we believe that the nth sensor is likely to be affected 
by the change-point if U n k,t is larger than a. Let !{•} denote an indicator 
function. For each t, assume s Ut t = I {ma,xt- w <k<tU ni k,t > a} is a Bernoulli 
random variable with parameter p n . Due to conjugacy, the posterior of po at 
the nth sensor, given s njt up to time t , is also a Beta distribution with param¬ 
eters Beta (s nt t + a, 1 — s n ,t + /?). An adaptive mixture procedure can be formed 
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p 

Fig. 9 Simulated EDD for a mixture procedure with po set to a fixed value, versus a mixture 
procedure with po = p equal to the true fraction of affected sensors, when c n = 0.1, N = 100 
and w = 200. 


using the posterior mean of po, which is given by p n = (s n ,t + a)/(a + 0 + 1): 


N 


T, 


adaptive 


= inf < t : 


max 

t—w<k<.t 


lo s0 - - 


exp(t/^ M /2)) > b 


(24) 


where b is a prescribed threshold. 

We compare the performance of T ac j ap tive with its non-adaptive counterpart 
T 2 by numerical simulations. Assume N = 100 and there are 10 sensors affected 
from the initial time with a rate-of-change c n = c. The parameters for T a d ap tive 
are a = 1, /? = 1 and a = 2. Again, the thresholds are set so that the simulated 
ARL for both procedures are 5000. Table [ 7 ] shows that T ac j ap tive has a much 
smaller EDD than T 2 when signal is weak with a relative improvement around 
20%. However, it is more difficult to analyze ARL of the adaptive method 
theoretically. 


Table 2 Comparing EDD of T 2 and T a daptive- 


Rate-of-change 

0.01 

0.03 

0.05 

0.07 

0.09 

Non-Adaptive T 2 

54.15 

26.24 

18.75 

14.98 

12.74 

Adaptive T a( japtive 

38.56 

20.28 

14.42 

12.17 

10.13 
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A An informal derivation of Theorem |TJ ARL 


We first obtain an approximation to the probability that the stopping time is greater than 
some big constant m. Such an approximation is obtained using a general method for com¬ 
puting first passing probabilities first introduced in na and developed in m- The method 
relies on measure transformations that shift the distribution of each sensor over a window 
that contains the hypothesized post-change samples. More technical details to make the 
proofs more rigorous are omitted. These details have been described and proved in m ■ 

In the following, let t = t — k. Define the log moment-generating-function -i/> r (0) = 
logEexp{^(f/ n) fc ) t)}. Recall that U n ,h,t is a generic standardized sum over all observations 
within a window of size r in one sensor, and the parameter 6 = 0 T is selected by solving the 
equation 

^ r (6>) = b/N. 

Since U n ,k,t is a standardized weighted sum of r independent random variables, -i/v converges 
to a limit as r —»■ oo, and 0 T converges to a limiting value. We denote this limiting value by 
0 . 

Denote the density function under the null as P. The transformed distribution for all 
sequences at a fixed current time t and at a hypothesized change-point time k (and hence 
there are r hypothesized post-change samples) is denoted by P^ and is defined via 


dPj = exp 


N 

6 r Y, 9 (U n ,k,t) - 

n=1 


dP. 


Let 


Let the region 


e N ,k,t = log^/dP) = e T J 2 g(u n ,k,t) - 


D = {(£, k) : 0 < t < to,l < t — k < w} 
be the set of all possible change-point times and time up to a horizon m. Let 


A = \ max V' g{U ntkit ) > b i . 
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be the event of interest. Hence, we have 


P{A} = E 

( t,k)eD 


E 


( t',k')eD f 


777 ’ A \= E e M E 


— ^ * e — N(6 T b— iIjt(Ot)) ^ jgi 


( t,k)eD 


r Mn, 

l Sjsr,k,t 


( t,k)£D 
k,t l 


( t',k')eD 


■A 


e -tN,k,t -log M N,k,t-£ N _|_ log Afiv.fc.t > 0 


(25) 


where 


n=1 

S N ,k,t= E e E "= 1 ' Ms(c/ ".'' , .‘ ,)_s(t/ ’ 

( t',k')eD 

p S»=i 8T[s(tr nifc , tt ,)- s (i7 n ,*, t )] 


)] 


Af/v k t — max 
’ ’ ( t',k')eD 

As explained in he under certain verifiable assumptions, a “localization lemma” allows 
simplifying the quantities of the form in I into a much simpler expression of the form 

where ctn,t is the PJ standard deviation of £jy and E [M/S\ is the limit of E{Mjv k t/£>N k t} 
as N —> oo. This reduction relies on the fact that, for large N and m, the “local” processes 
M n fc t and Sjy k t are approximately independent of the “global” process l n . This allows 
the expectation to be decomposed into the expectation of Mn/Sn times the expectation 
involving £jy + log Mn, treating log Mn as a constant. 

Let t' = t' — k ', and denote by z n i = (y n ,i — ^n)/^n which are i.i.d. normal random 
variables, i = 1,2,.... Note that, use Taylor expansion up to the first order, we obtain 
N N 

^ ^ ^r[g(Un,k' ,t') ~ 9(Un,k,t)] ~ ^ ^ @Tg{Un,k,t) \Un,k' ,t' ~ 

71 = 1 71 = 1 


= E e r9{U n ^t)[A T } ,2 W nry - A t 1/2 W n , k , it , + 

71 = 1 

N 

= E 


4-1/2 


W, 


n,k' ,t‘ 


— a t 1/2 



(26) 


Note that in the above expression, the first term has two weighted data sequences running 
backwards from t' and when r and t' both tends to infinity they tend to cancel with each 
other. Hence, asymptotically we need to consider the second term. Observe that one may let 
t' — k! = t and 6 = lim r _>.oo 0 T for 0 T in the definition of the increments and still maintain 
the required level of accuracy. When r = u the first term in the above expression, and 
the second term consists of two terms that are highly correlated. The second term can be 
rewritten as 


- 1/2 


6 r 


^ ^ g(J^n,k,t')^ r n,k' ,t' ^ ^ gij^n' ,k,t)^^n' ,k,t 


(27) 


Since all sensors are assumed to be independent (or has been whitened by a known covariance 
matrix so the transformed coordinates are independent), so the covariance between the two 
terms is given by 

/ N N \ N 

Cov ['52g(jUn,h,t)W n , h ,t, E 9(£4',m)WV,m = J2ti( U n',k,t)] 2 Cov(W niktt ,W n ,, k> ' t ,). 


(28) 

For each n, let k < k' < t < t' and t — k = t' — k! = r, and define u — k' — k and 
s — t — k r . We have 


A T 1 Cov(W niktt ,W n ^ k ^ t >) =E 


=E 


(5E=fc + l(* fc) 2 n,i) (Ei = fc' + l(* k')z n j^ 


EL 


E\= k '+i (® - m - k')zl .) _ J2t =1 i 2 + «EU < 


Er=i< 


ELi * 2 
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By choosing u = s/r, we know that the expression above is approximately on the order of 

( k' — k) + ( t' — t) ( k' — k) + ( t' — t ) 

2 (§ r2 + 5 r ) ~ 1 I t2 

Let rj = | t 2 . Hence, by summarizing the derivations above and applying the law of large 
number, we have that when N —>■ oo and r —¥ oo, the covariance between the two terms 
become 


/ N N \ 11 

c °v ^ 0 T g(U n k ' t i), ^2 d rg(U n f k t ) 6» 2 iV ■ [1-(fc' - fc)-(t' -1)]. 

V^I JTi / V V 

This shows that the two-dimensional random walk decouples in the change-point time k' 
and the time index t' and the variance of the increments in these two directions are the 
same and are both equal to 6 2 N/r]. Hence, the random walk along these two coordiates 
are asymptotically independent and it becomes similar to the case studied in CD- Compare 
this with (the equation following equation (A.4) in mi), note that the only difference is 
that here the variance of the increment is proportional to 3/(4r 2 ) instead of r, so we may 
follow a similar chains of calculation as in the proof in Chapter 7 of ns, in m , the final 
result corresponds to modifying the upper and lower limit by changing the window length 
expression to be yj 4/3 and yj 4ic/3. 


B An informal derivation of Theorem [2} EDD 


Recall that U n ^,t is defined in (|6j), let z Uj i = ( y n ,i — M n)/crn • Then for n £ A, z n ,i are i.i.d. 
normal random variables with mean c n i/a n and unit variance, and for n G A c , z n ,i are i.i.d. 
standard normal random variables. Since we may write 


Sl=fc+i(^ k)z n ,i 

/EU+i (i-fc) 2 ' 


For any time t and nGi, we have 


(29) 


1+ fe) S ,2 - l+ te) : 


t(t + 1)(2£ + 1) _ fCn\ t 


2 +3 


= — -+o(t*), (30) 


which grows cubically with respect to time. For the unaffected sensors, n G A c , Eq ^{U 2 0 £ } = 
1. Hence, the value of the detection statistic will be dominated by those affected sensors. 
On the other hand, note that when x is large, 

g(x) = log(l -po +poe x2/2 ) = logp 0 + %- + log ( l - ™ e - x2/2 ') « %- + logp 0 - 

2 V PO / 2 

Then the expectation of the statistic in § can be computed if w is sufficiently large (at 
least larger than the expected detection delay), as follows: 


E ° IT.?? E 9 ( U n,k,t) | ~ (l^llogPO + \ K o{Un, k ,t} + 

V n =1 J y nEA 

At the stopping time, if we ignore of the overshoot of the threshold over b, the value statistic 
is 6. Use Wald’s identify m and if we ignore the overshoot of the statistic over the threshold 
6, we may obtain a first order approximation as b -A oo, by solving 


\A\ logpo + 


2 



(31) 


From Jensen’s inequality, we know that Eq^ITj} > (Ef^jTb}) 3 . Therefore, a first-order 
approximation for the expected detection delay is given by 


( b- JVIogpo —(JV— |^|)E{g(f/)} 

V Z \ 2 /6 


1/3 

+ o(l). 


E £{T 2 } < 


(32) 
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C Proof for Optimality 

Proof (Proof of Theorems 1) The proof starts by a change of measure from Poo to P^). For 
any stopping time T £ Cpy j, we have that for any K~ ; . > 0, C > 0 and e £ (0,1), 

Poo {k < T < k + (1 — e)K 1 \T > k} 

= ex P(—AA,fc,T)|T > fcj 

> E fc l { i {k<T<k+(l-e)K 1 ,\ A k T <C} ex P(-^,J=,T)|'P > &} 


> e 


—C -p. 4 . 


fc<T<fc + (l — e)/f 7 


k<j<k+(l-e)K. 

> e~ c [p£ {T < k + (1 - e)K-y\T > fc} - 


^A,k,j < C 


T > k 


(33) 


max ^ tv 

1<j<(1-e)X 7 


T > fc 


where I{a} ' s the indicator function of any event A, the first equality is Wald’s likelihood 
ratio identity and the last inequality uses the fact that for any event A and B and probability 
measure P , P ( A f~) B) > P(A) — P (B c ). 

From (|33|) we have for any e £ (0,1) 


P£ {T < k + (1 - e)K^\T >k}< pW (T) + 0$(T), 


a(k) / 


(34) 


where 


p^’(T) = e c Poo {T < k + (1 - e)K y \T > k} , 


ti k l(T) =P^ max > C 

1 <J <(1 -e)K. 


T > k> . 


Next, we want to show that both p^l(T) and /3^l(T) converge to zero for any T E C( 7 ) 
and any fc > 0 as 7 goes to infinity. 

First, choosing C = (1 + s)I[(l — e)K 1 ] q ^ then we have 

P'hlfT) = Pfe | [(1 — e)A' 7 ] -9 max X A ,k,k+j > (1 + e)I 

< esssup P^ < [(1 - e)K 1 ]~ q max X Aiktk+j > ( 

l<j<(l-e)X 7 

By the assumption ( | 17| ) , we have 

sup /3~ k l -s> 0. (36) 

0<fc<oo ’ 

Second, by Lemma 6.3.1 in m, we know that for any T E ( 7 ( 7 ) there exists a k > 0, 
possibly depending on 7 , such that 

Poo {T < k + (1 - e)K 1 \T >k}<( 1 - e)tf 7 / 7 . 

Choosing K 7 = (7 — 1 log 7 ) 1 / <? , then we have 

C = (1 + e)/(l - E) q I~ l log 7 = (1 - e 2 )(l - e ) 3-1 logq, 


T > fc 

1 + e)J 


Tk > • 


(35) 


and therefore, 


P$( T ) < 7 (1 " £2)(1 " e) ‘ ! 1 (1 - e)K^h 

= (1 - e )(/ _1 log 7) 1 / 3 7 ( 1 - e2 )( 1 - E )‘ ! " 1 - 1 


-4 0 , 


(37) 


where the last convergence holds since for any q > 1 and e E (0,1) we have (1 — e 2 )(l — 
e ) q — 1 < 1. Therefore, for every e E (0,1) and for any T E C{ 7 ) we have that for some fc > 0, 

{T < k + (1 - £)K 7 |T > fc}-> 0, 

7 —»-oo 


which proves 

Next, to prove \19\ , since 


ESM^(T) > SM^(T) > sup {[(T - fc)+] m |T > k} , 

0 <fc<oo 
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it is suffice to show that for any T G £*( 7 ), 

sup {{(T - k)+] m \T > k] > [/- 1 log 7 ] m /9(l + o(l)) as 7 ^ 0 , (38) 

0</c<oo 

where the residual term o(l) does not depend on T. Using the result EH just proved, we 
can have that for any e G (0,1), there exists some k > 0 such that 

inf -k> (1 — e )(/ _1 log 7 ) 9 It > k\ -> 1 . 

TGC(7) l _ I J 7^-oc 


Therefore, by also Chebyshev inequality, for any e G (0,1) and T G C( 7 ), there exist some 
k > 0 such that 


Efc l {[(T-fc) + ] m |T>fc} 

> [(1 - e ))/- 1 log 7 ) {T - k > (1 - e )(/ _1 log 7 )* |t > fc} ( 39 ) 

> [(1 - log7) m /«] (1 + o(l)), as 7 —>■ 00, 


where the residual term does not depend on T. Since we can arbitrarily choose e G (0,1) 
such that the \S9\ holds, so we have ( |38| ), which completes the proof. 

Proof (Proof of Lemma Qp Rewrite Tcs(b) as 


Tcs(&) = inf 


: max 
0 <k<t 


n ( x ~~ P0 + P0 eX P(*n,k,t)) > ‘ 


(40) 


and define Tsr(6 ) an extended Shiryaev-Roberts (SR) procedure as follows: 

T SR ( 6 ) = inf{f :R t > e b ] , (41) 

where 

t -1 N 

Rt = n (! - PO + PO exp(A n ,k,t)) , t = 1, 2,...; Rq = 0. 

fc = l 71=1 

Clearly, Tcs(b) > Tsr(6 ). Therefore, it is sufficient to show that Tsr(6 ) G ( 7 ( 7 ) if 6 > log 7 . 
Noticing the martingale properties of the likelihood ratios, we have 

Eoo {exp(A nifejt )|^ r t _i} = 1 (42) 

for all n = 1, 2,..., N, t > 0 and 0 < k < t. Moreover, noticing that 


t -2 AT AT 

fit = 53 n ( X “ Po +Po ex P( X n,k,t-l + K,t-l,t)) + fl (1 -po +Poexp(A n! t_i i i)), 

k= 1 71 = 1 71 = 1 

(43) 

then combining |42| we have for all t > 0, 


i —2 AT 

Eoo{Rt|^-l} = En (1 - Po + Po exp(A n)fc)t _i) • 1 ) + 1 

fc= 1 71=1 

—Rt—1 + l. 


(44) 


Therefore, the statistic {Rt — t} t>0 is a (Poo,Rt)-martingale with zero mean. If Eoo {Tgji(b)} 
00 then the theorem is naturally correct, so we only suppose that Eqo {Tsr(6 )} < 00 and 
thus Eoo {Rt sr ( 6 ) — ^Sr(&)} exists. Next, since 0 < Rt < e b on the event (Tsr(^) > £}, we 
have 

liminf / | Rt — t\ dPoo = 0. 

°° J {T S n(b)>t} 

Now we can apply the optional sampling theorem to have Eoo |RT SR ( b ) J = Eoo {Tsr(6 )}. By 
the definition of stopping time Tsr(6 ), we have Rts R (6 ) > e b • Thus, we have Eqo (Rcs(^)} > 
Eoo {^sr(6 )} > e b , which shows that Eoo {Tqs(&)} > 7 if b > log 7 . 


Proof (Proof of Theorem^) First, we notice that if b > log 7 
fioc {fcs(6)} > Eoo {Tcs(fe)} > 7- 


Therefore, by Theorem |3| it is sufficient to show that if 6 > log 7 and b = 0 (log 7 ), then 


711 /q 


ESM£(T C s(b)) < 


log 7 

Ia 


(1 + o(l)) as 7 —> 00. 


(45) 
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Equivalently, it is sufficient to prove that 


/ j \m/q 

ESM£(T C s( 6 )) < ( — J (l + o(l)) as b^oo. 


To start with, we consider a special case when po = 1 in Tcs and denote it by 

N 


TcS 2 ( 6 ) = inf it > 0 : max V' A n k t > b \ . 

I J 


(46) 


Next, we will prove an asymptotical upper bound for the detection delay of Tcsiib). 
Let 

b \ 1/q 

G b = 


U( 1-e) 


(47) 


and then (G b ) q < b/f/^) 1 — e)]. Noticing that under Pj5, we have V= ^A.h.t 
almost surely since the the log-likelihood ratios are 0 for the sensors that are not affected. 
Therefore, by El we can have that for any e E ( 0 , 1 ), t > 0 and some sufficiently large 6 , 


r * 

sup esssup < Y] K,k,k+G b < (G b ) q I A (l 

°<*<* IZYl 



< sup esssup {^A,k, k+Gb <{G b ) q I A {\-e)\T k } 

0 <k<t 


< sup esssup {\ A} k,k+G b < b\Fk} < e- 
0 <k<t 


(48) 


Then, for any k > 0 and integer Z > 1, we can use |48| Z times by conditioning on 
(X n? i,..., X n> k_f.(z 0 _i)G b ) » n = 1 , 2,..., TV for Zo = Z, Z — 1 ,..., 1 in succession (see [ 7 ]) 
to have 

esssup P^ {T CS2 (b) - fc > lG b \T k } 

\ N 

< esssup P^ Y X n,k+(l 0 -l)G b + l,k+l 0 G b ,lo = !,■■■, I 

{n=l 


Tk ^ < e‘. 


(49) 


Therefore, for sufficiently large 6 and any e E (0,1), we have 


ESM m (T CS 2 (b)) < £ {[(* + 1 )G b ] m - ( lG b ) m } ■ 

1=0 

sup esssup P fc {[(T CS2 - k)+] m > ( lG b ) m |-Efc} 

0 <fc<oo 

oo 

< (G i ,) m y][(Z + l) m -Z m ]e i 
1=0 

= (G&) m (l + o(l)) as b —> oo, 


(50) 


where the first inequality can be known directly from the geometric interpretation of ex¬ 
pectation of discrete nonnegative random variables and the last equality holds since for any 
given m> 1 . [(1 + l) m - -+ 1 as Z —>• oo so tha t the radius of convergence is 1. Using 

the fact that ( Gb) rn < [6/7(1 — e )] 771 / 9 we prove ( |46[ ) for the case po = 1. 

Next, we will deal with the case when po E (0,1). Rewrite Tcs 2 (b) as 


TqS 2 ( 6 ) = inf < t 


: max 
0 <k<t 


N log po - 


N 

E- 

n= 1 


> 6 + N log po 


then 


Tcsiib - JVlogpo) = inf jt : ^JVlogpo + ^ j 

Clearly, ESM^(T cs (b)) < ESM^(T CS2 (b - JVlogpo)), and thus 



ESM£(T C s(b)) < 


b — N log po 
~ Ta 


m/q 

(1 + 0 ( 1 )). 


(51) 


Therefore, we can claim that ( |46[ > holds for any fixed po E (0,1] since N and po are constants. 
If b > log 7 and b = 0 (log 7 ), Tcs ( 6 ) belongs to C( 7 ) and ESM^(Tcs) achieves its lower 
bound. 
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Proof (Proof of Corollary] 7]) 

The main steps are almost the same with that in the proof of Theorem The only 
different thing is that we need the condition w 1 > Gb (defined in ( |47| ) in order to make 
< |49[ ) be correct for any k > 0 and any integer l > 1. And the additional assumption \22\ 
ensures this. 


Proof (Proof of Lemma [^[) Consider testing problem Q, then for any k > 0 and j > 1, 


fc+j 


n£A < ^ n i=k-\-1 


Cn(» ~ fc ) 2 

2 


We define, for each n £ A and for all l = 1 ,..., ji, 

^n,l = ~2 \ Cn ^(yn,l+k ~ A 4 j ) ' 


C 2 / 2 

2 


Then we have 


Z=1 nG-4 Z=1 

where we define X n} • 

Under probability measure P^5, we easily know that are independent variables 

which follow normal distribution N((l 2 /2) c 2 , l 2 c 2 ). Other simple computa¬ 

tion tells us that 

E k{( X A^ 2 } <°°’ V * = l.J. 

and under probability measure P*^, 

]TVar' ' ^ 


Z3 


< oo, 


where Var(A) denotes the variance of random variable X. Therefore, combining Kroneckers 
lemma with the Kolmogorov convergence criteria, we have immediately a strong law of large 
numbers which tells us that 

1 x a.s. c n 

.3 ^A,k,k-\-j ~ ” ^ / _j r. 2 ‘ 

7 a j^-oo Gcrt. 

J n£A n 

Finally, we complete the proof by using the fact that all the observations are independent. 

then 


Proof (Proof of Lemma |s|) First, define j/*_ t = i-fcp ’ 


. {^2(6} > toj > Poo | 


N {Vn.tJ 2 


E 


I 0<t<to max(0,t-m^)<fc<to n~l ^ 
> [Poo{l‘ < 2b}] W T t O , 


< b 


(52) 


where Y is a random variable with Xjv distribution. Then, since T 2 ( 6 ) is a non-negative 
discrete random variable, we have 


00 

{t 2 (6)} = ^ Poo{t 2 (6) >t„} 


* 0=0 
00 

> [ p °°{ y < 26}]™ 7 * 0 = 


(53) 


*o=0 

Then if we can choose some b so that 


1 - [Poo{y < 2 6}] 1 


-{Y > 26} < 1 - ( 1 - 


1 / 


, |T 2 ( 6 )| > 7 and thus T 2 (b) E C( 7 ). To choose appropriate threshold 
6 , we need use the tail bound for the distribution. Since \i is sub-exponential with 
parameter (2y/N, 4), it is well known that Poo{T > 2b} < exp(— 2b L N ) if b > A. If we set 


we can claim that E 0 


6 > — -41 
~ 2 


1 - 1 - 


1 / m-y 


then Ti( 6 ) S C( 7 ). 
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Proof (Proof of Theorem^) 

By Lemma [2] we can use Theorem |3| to obtain a lower bound for the detection delays 
of arbitrary procedures in C( 7 ). Specifically, for all m > 1, 


liminf inf ESM^(T) > liminf inf SM^(T) > ^. (54) 

7->-oo TeC( 7) 7->-oo TeC( 7) \ Ia / 

(i) Since Ti(6) is a specified mixture CUSUM procedure for testing problem 0 and the 
observations are independent, the optimality is an immediate corollary from Theorem |4| 

(ii) Since T\(b) is a specified window-limited mixture CUSUM procedure for testing 
problem 0 and the observations are independent, the optimality is an immediate result 
from Corollary [T| 

(iii) The assumption that log w 7 = o(log 7 ) ensures that b > ^—4 log 

and b = 0(log7) can be satisfied simultaneously. Since the observations are independent, 
then ESMf(f 2 (6)) = SM f(T 2 (b)) = E^[f 2 (6)]. The optimality of T 2 (b) is an immediate 
result from Lemma [ 3 ] and the first order approximation of the detection delays in ( |32| >. 






